Statistical Stemming for Kannada
نویسنده
چکیده
Stemming is a process that groups morphologically related words into the same class and is widely used in information retrieval for improving recall rate. Here we study a set of statistical stemmers for Kannada, a resource-poor language with highly inflectional and agglutinative morphology. We compare stemming using simple truncation, clustering and an unsupervised morpheme segmentation algorithm on a sample from a text collection. We observe that a distance measure that rewards longest prefix matches is the best performing clustering-based stemmer. However, using a reasonably performing unsupervised morpheme segmentation seems to outperform the other stemming schemes considered.
منابع مشابه
Nlp Challenges for Machine Translation from English to Indian Languages
This Natural Langauge processing is carried particularly on English-Kannada/Telugu. Kannada is a language of India. The Kannada language has a classification of Dravidian, Southern, Tamil-Kannada, and Kannada. Regions Spoken: Kannada is also spoken in Karnataka, Andhra Pradesh, Tamil Nadu, and Maharashtra. Population: The total population of people who speak Kannada is 35,346,000, as of 1997. A...
متن کاملDevelopment and standardization of Morningness-Eveningness Questionnaire (MEQ) in the Indian language Kannada.
INTRODUCTION A circadian rhythm is any biological process that displays an endogenous, entrainable, oscillation of about 24 hours; the rhythms driven by a circadian clock and sleep have been widely observed in plants, animals, fungi and cyanobacteria. The main aim of the current study was to translate and validate the Morningness-Eveningness Questionnaire (MEQ) to Kannada (MEQ-K). MATERIALS A...
متن کاملA Maximum Entropy Approach to Kannada Part Of Speech Tagging
Part Of Speech (POS) tagging is the most important pre-processing step in almost all Natural Language Processing (NLP) applications. It is defined as the process of classifying each word in a text with its appropriate part of speech. In this paper, the probabilistic classifier technique of Maximum Entropy model is experimented for the tagging of Kannada sentences. Kannada language is agglutinat...
متن کاملAdaptation of the Oswestry Disability Index to Kannada Language and Evaluation of Its Validity and Reliability.
STUDY DESIGN A translation, cross-cultural adaptation, and validation study. OBJECTIVE The aim of this study was to translate, adapt cross-culturally, and validate the Kannada version of the Oswestry Disability Index (ODI). SUMMARY OF BACKGROUND DATA Low back pain is recognized as an important public health problem. Self-administered condition-specific questionnaires are important tools for...
متن کاملNamed Entity Recognition and Classification in Kannada Language
Named Entity Recognition and classification (NERC) is an essential and challenging task in (NLP). Kannada is a highly inflectional and agglutinating language providing one of the richest and most challenging sets of linguistic and statistical features resulting in long and complex word forms, which is large in number. It is primarily a suffixing Language and inflected word starts with a root an...
متن کامل